Skip to content

Conversation

@Skn0tt
Copy link
Member

@Skn0tt Skn0tt commented Dec 19, 2025

Inspired by #38598, alternative to #38588. Let's play around with it and see how it works for our own CI.

@github-actions

This comment has been minimized.

Copy link
Member

@pavelfeldman pavelfeldman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm module cli flag.

- type: <[null]|[Object]>
- `total` <[int]> The total number of shards.
- `current` <[int]> The index of the shard to execute, one-based.
- `weights` ?<[Array]<[int]>> The shard weights.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not expose them for now.

## Rebalancing Shards

```
npx playwright test --shard=1/4,26/24/25/25
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say --shard=1/4 --shard-weights=26:24:25:25 to keep things readable

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

Test results for "MCP"

2727 passed, 116 skipped


Merge workflow run.

@Skn0tt Skn0tt merged commit cb93c44 into microsoft:main Dec 23, 2025
30 of 31 checks passed
@github-actions
Copy link
Contributor

Test results for "tests 1"

1 failed
❌ [playwright-test] › runner.spec.ts:124 › should ignore subprocess creation error because of SIGINT @macos-latest-node20-2

1 flaky ⚠️ [firefox-library] › library/inspector/cli-codegen-1.spec.ts:1082 › cli codegen › should not throw csp directive violation errors `@firefox-ubuntu-22.04-node20`

34396 passed, 689 skipped


Merge workflow run.

@sk33wiff
Copy link

sk33wiff commented Jan 7, 2026

@Skn0tt what release will include these changes?

@Skn0tt
Copy link
Member Author

Skn0tt commented Jan 8, 2026

We're planning for 1.58 to include this.

@gpaciga
Copy link

gpaciga commented Jan 12, 2026

I don't see how this solves the distribution problem raised in the various referenced issues. Consider an unbalanced suite with 1000 tests that take 1s each (~17 minutes total) and two tests that each take 20 minutes, but are unluckily the last tests in the list. I don't think I can use weights to evenly balance this into 2 shards. If the the shards are weighted to split up the two 20 minute tests, the first shard will be 37 minutes and second 20 minutes. If the two are in the same shard, the first shard will be at most 17 minutes and the 2nd will be at least 40 minutes.

So, this is not actually a viable alternative to #38588

@gerardaz
Copy link

gerardaz commented Jan 12, 2026

+1 to the above – I'm not convinced this really solves the problem, or if it does, only in a way which doesn't seem very user friendly.

If I'm not mistaken, this just means you can control how many tests are assigned to each shard? So you need to evaluate ahead of time how the tests will bucket, right?

It won't affect the ordering of them. If you have 6 heavy tests in direct sequence (e.g. 10 minutes each) while most the rest of the 300 tests take seconds, it doesn't seem like there is any conceivable way to come close to balancing them with this implementation, except having nearly 300 shards.

I'm not sure I understand the scenarios that this is solving for though, so maybe this is my own misunderstanding or lack of imagination.

Doesn't this only really address cases where the tests have a sort of normal distribution of length? So you'd compute (how I'm not sure) where to redraw the boundaries to rebalance them?

My situation, which I feel from the above comment from @gpaciga is not unique, is we have a roughly even distribution of runtimes, with a small number of clustered extreme outliers. If I have 6 outliers in a suite of 300 tests, with 6 shards, the optimal distribution is that none of these six run in the same shards as any of the other (or close to it). But this weighting wouldn't appear to make this possible – even if it did, wouldn't new tests distort the distribution?

@muhqu
Copy link
Contributor

muhqu commented Jan 12, 2026

@Skn0tt great to see sharding getting some attention!

However, I wonder what's the reason this sharding weights approach is considered as an alternative to sharding based on duration / timing data?

@Skn0tt
Copy link
Member Author

Skn0tt commented Jan 13, 2026

That's good feedback, thanks. We were thinking that weights are easier to maintain than duration data, but you're right that for uneven scenarios this doesn't solve the ordering problem. Whereas something like greedy binpacking based on duration data would. I'll take this feedback back to the drawing board.

@muhqu
Copy link
Contributor

muhqu commented Jan 13, 2026

@Skn0tt the shard weights how they are implemented in this PR benefit when your shards always run on the same machines and they have different performance specs. E.g. shard 1/4 runs on a high performing machine and therefor the shard can take more weight than the other shards.

However, if you utilize cloud infrastructure to run your tests it's very likely that all shards run on machines that have equal performance specs.

It would be great if you could have different weight per test and the sharding automatically distributes tests based on these per test weights... wdyt?

@Skn0tt
Copy link
Member Author

Skn0tt commented Jan 13, 2026

I see how shard weights can help balance across different machine types, but shards typically all use the same specs. Weights can still help, because sharding is stable. The same test always lands in the same shard, and if a significant number of expensive tests happens to land in one shard, that shard will take longer.

That's what's happening in our own tests: We have roughly 40k tests across different bots and 2 shards, and we found that shard 2 alwasy took longer because those tests are just slightly longer, and it compounds at scale. So attributing more weight to shard 1 allows us to iron out the disbalance, see #38635 for numbers.

This works best in large test suites where the disbalance is compounding, and the number of tests is stable. It's not as effective in the case outlined above, where there's a small number of disproportionally expensive tests that all land in the same shard.

#38588 is the alternative that solves both usecases, but we were afraid of the usability of committing duration stats into the repo. We'll reevaluate.

@muhqu
Copy link
Contributor

muhqu commented Jan 13, 2026

@Skn0tt There is no need to commit the durations to the repository. We're also not doing that. They just serve as a way to optimize overall CI time.

In our setup (still using the changes of #30962, since 2yrs already) we have the duration / timing data in the last-run.json, which we also do not check into source control. But our CI archives the last-run.json for every main build and for every build we're restoring the last successful last-run.json which gives us optimal distribution of tests across shards.

@Skn0tt
Copy link
Member Author

Skn0tt commented Jan 13, 2026

I see! What CI runner are you using, and how are you restoring the file across builds?

@muhqu
Copy link
Contributor

muhqu commented Jan 13, 2026

We're using Jenkins Pipeline. The Archive step allows us to make the last-run.json available to other pipelines...

however, I'm pretty sure you can use a post-build step in any other CI solution to store your test durations of the last successful main build in some place that you can then use for the next build. E.g. you could even post it as a gist or upload to an AWS S3 bucket...

@gpaciga
Copy link

gpaciga commented Jan 13, 2026

FWIW I've been using the same approach as @muhqu for approx 6 months, on Github actions, archiving last-run.json in an s3 bucket. The file is downloaded at the start of each run. The results from the main branch are used the first time a new PR runs, since it won't have its own timing data yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants